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The "large p, small n" paradigm arises in microarray studies, where ex- 
pression levels of thousands of genes are monitored for a small number 
of subjects. There has been an increasing demand for study of asymp- 
totics for the various statistical models and methodologies using genomic 
data. In this article, we focus on one-sample and two-sample microar- 
ray experiments, where the goal is to identify significantly differentially 
expressed genes. We establish uniform consistency of certain estimators 
of marginal distribution functions, sample means and sample medians 
under the large p small n assumption. We also establish uniform consis- 
tency of marginal p- values based on certain asymptotic approximations 
which permit inference based on false discovery rate techniques. The 
affects of the normalization process on these results is also investigated. 
Simulation studies and data analyses are used to assess finite sample 
performance. 
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1. Introduction. Microarrays are capable of monitoring gene expression on 
a large scale and are becoming a routine tool in biomedical research. Studies of 
associations between microarray measurements and variations of phenotypes can 
lead to a better treatment assignment and so there has been an increasing demand for 
novel statistical tools analyzing such data. For example, several recent developments 
in microarray data analysis have involved semiparametric model methodology. Such 
research includes, but is not limited to, estimation of normalization effects with a 
semi- linear in-slide model (SLIM) in Fan, Peng and Huang (2004) (FPH hereafter), 
estimation and inference of gene effects in Yang et al. (2001) and Huang, Wang and 
Zhang (2005) (HWZ hereafter), classification of phenotypes based on Affymetric 
genechip data in Ghosh and Chinnaiyan (2004), and survival analysis with right 
censored data and genomic covariates (Gui and Li, 2004). 

Although statistical analysis with microarray data has been one of the most 
investigated areas, theoretical studies of asymptotic properties of different statistical 
methodologies remain rare (for important exceptions to this, see van der Laan and 
Bryan, 2001; FPH; and HWZ). The paucity of such research is partly caused by the 
abnormal type of asymptotics associated with microarrays: the dimension of the 
covariate p is usually much larger than the sample size n, i.e., the "large p, small 
n" paradigm referred to in West (2003). In this article, we focus on asymptotics for 
the simple settings of one-sample and two-sample comparisons, where the goal is to 
find genes differentially expressed for different phenotype groups. 

Consider, for example, a simple one-sample cDNA microarray study, where the 
goal is to identify genes differentially expressed from zero. Note that this data set- 
ting and the following discussions can be easily extended to incorporate two-sample 
microarray studies as in Yang et al. (2001). Studies using Affymetrix genechip data 
can be included in the same framework with only minor modifications. Denote Yij 
and Zij as the background-corrected log-ratios and log-intensities (as in HWZ), for 
array i = 1, . . . , n and gene j = 1, . . . ,p. We consider the following simplified partial 
linear model for cDNA microarray data: 

(1) Yij = Hj + hi{Zij) + eij, 
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where are the fixed gene eflFects, hi{Zij) are the smooth array-specific normal- 
ization efi'ects (constrained to have mean zero within each array) and Cij are mean 
zero (within array) random errors. The constraints are for model identifiability. 
For simplicity of exposition, we have omitted other potentially important terms in 
our model, such as possible print-tip effects, and array-specific position and scale 
constants. We note, however, that the theory we present in this paper can extend 
readily to these richer models. 

Models similar to ^ have been investigated by HWZ and FPH. In HWZ, asymp- 
totic properties based on least squares estimation are established assuming fixed p 
and n ^ oo. It is shown that and hi can both be consistently estimated with 
optimal convergence rates. In FPH, partial consistency type asymptotics are es- 
tablished. It is proved that when n is fixed and p ^ oo, hi can be consistently 
estimated by an estimator hi, although fij cannot be consistently estimated. If we 
let Xij = fii + eij and Xij = Yij — hi {Xij ) , the results of FPH can be restated as 
maxi<j<„ maxi<j<p — Xij\ = op(l). In otherwords, the normalization process 
is consistent. This permits the use of the normalized array-specific gene effects Xij 
for inference in place of the true array-specific gene effects Xij. However, because 
n is fixed, the permissible inference tools at the gene level are restricted to exact 
methods, such as permutation tests. 

The goal of our paper is to study normalization and inference when the number 
of arrays n — > cxd slowly while the number of genes p » n. This is essentially 
the same asymptotic framework considered in van der Laan and Bryan (2001) who 
show that provided the range of expression levels is bounded, the sample means 
consistently estimate the mean gene effects uniformly across genes whenever logp = 
o(n). We extend the results of van der Laan and Bryan (2001), FPH and HWZ in 
three important ways. First, uniform consistency results are extended to general 
empirical distribution functions and sample medians. Second, a precise Brownian 
bridge approximation to the empirical distribution function is developed and utilized 
to establish uniform validity of marginal p-values based on approximations which 
are asymptotic in n. The statistical tests we consider for this purpose include both 
one and two sample mean and median tests as well as several other functionals 
of the empirical distribution function. We find that the rate requirement is either 
\ogPn = o(n^/^) or \ogpn = o(n^/'^), depending on the choice of test statistic. Third, 
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these results arc further extended to allow for the presence of normalization error. 

An important consequence of these results is that approximate p- values based on 
normalized gene expression data can be validly applied to false discovery rate (FDR) 
methods (sec Bcnjamini and Hochberg, 1995) for identifying differentially expressed 
genes. We refer to this kind of asymptotic regime as "marginal asymptotics" (sec also 
Kosorok and Ma, 2005) because the focus of the inference is at the marginal (gene) 
level, even though the results are uniformly valid over all genes. The main conclusion 
of our paper is that the marginal asymptotic regime is valid even if the number of 
genes increases almost exponentially relative to the number of arrays, i.e., logp„ = 
o(n") for some a > 0. Qualitatively, this seems to be the correct order of asymptotics 
for microarray experiments with a moderate number, say ~ 50, of replications. The 
main tools we use to obtain these results include maximal inequalities, a specialized 
Hungarian construction for the empirical distribution function, and a precise bound 
on the modulus of continuity of Brownian motion. 

The article is organized as follows. In sections 2-4, we investigate marginal 
asymptotics based on the true gene effects (no normalization error). Section 2 
discusses one-sample inference based on the mean and the median. Section 3 extends 
section 2 to the two-sample setting. Section 4 considers one and two sample inference 
when the statistics are distribution free. Section 5 demonstrates under reasonable 
regularity conditions that the asymptotic results obtained in sections 2-4 are not 
affected by the normalization process. Simulation studies and data analyses in 
section 6 are used to assess the finite sample performance and to demonstrate the 
practical utility of the proposed asymptotic theory. A brief discussion is given in 
section 7. Proofs are given in section 8. 

2. Marginal asymptotics for one sample studies. The results of this 
section are based on the true data (without normalization error). For each n > 1, 
let . . . , -^r4(ri) be a sample of i.i.d. vectors of length pn, where the dependence 

within vectors is allowed to be arbitrary. Denote the jth component of the ith 
vector i.e., = (Xji(„-), . . . ,Xjp^(„))'. Also let the marginal distribution 

of -^ij(n) be denoted i^j(n)) s^nd let -Fj(„)(t) = Yl'i=i '^{-^ij(n) ^ t}, for all t € M 
and each j = 1, . . . where 1{A} is the indicator of A. Note that n can be viewed 
as the number of microarrays while Pn can be viewed as the number of genes. As 
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mentioned in the introduction, our asymptotic interest focuses on what happens 
when n increases slowly while pn increases rapidly. 

We first establish, in section 2.1, uniform consistency of the marginal empirical 
distribution function estimator and also the uniformity of a Brownian bridge ap- 
proximation to the standardized version of this estimator. These results are then 
used in section 2.2 to establish uniform consistency of the marginal sample means 
and uniform validity of marginal p-values based on the normal approximation to the 
t-test. The results are extended in section 2.3 for inference based on the marginal 
sample medians. Note that both the mean and median are functionals of the em- 
pirical distribution function. The mean is computationally simpler, but the median 
is more robust to data contamination. 

2. 1 Consistency of the marginal empirical distribution functions. The results of 
this section will form the basis for the results presented in sections 2.2 and 2.3. The 
two theorems of this section, theorems^andl^lbelow, are somewhat surprising, high 
dimensional extensions of two classical univariate results for empirical distribution 
functions: the celebrated Dvoretsky, Kiefer and Wolfowitz (1956) inequality as re- 
fined by Massart (1990) and the celebrated Komlos, Major and Tusnady (1976) Hun- 
garian construction as refined by Bretagnolle and Massart (1989). The extensions 
utilize maximal inequalities based on Orlicz norms (see chapter 2.2 of van der Vaart 
and Wellner, 1996). For any real random variable Y and any d > 1, let denote 
the Orlicz norm for ipdix) = e^"^ -I, i.e., \\Y\\^^ = inf |c > : E gl^l'*/*-^ - 1 < l|- 
Note that these norms increase with d (up to a constant depending only on d) and 
that II • 11^^ dominates all Lp norms (up to a constant depending only on p). Also 
let II • I loo be the uniform norm. 

The first theorem we present yields simultaneous consistency of all the -^^■(n)S for 
the corresponding -P'j(n)s: 

Theorem 1 There exists a universal constant < cq < cxd such that, for all 
n,Pn > 2, 



(2) 



max 



^ /logPn 
- • 
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In particular, if n ^ oo and logpn/n 0, then the left-hand-side of ^ goes to 
zero. 

Remark 1 One can show that the rate on the right-side of ^ is sharp, in the 
sense that there exist sequences of data sets, where (logp„/n)~^/^ maxi<j<p„ \\Fj^n)~ 
c, in probability, as n —> oo, and where < c < oo. In particular, the 



-f^j(n) I loo 

statement is true if the genes are all independent, n,p„ 
and c = 1/2. 



oo with log Pr, 



Gin] 



The second theorem shows that the standardized empirical processes ^/n{Fj(^n) ~ 
Fji^n)) can be simultaneously approximated by Brownian bridges in a manner which 
preserves the original dependency structure in the data. This feature will be useful in 
studying FDR (see Benjamini and Hochberg, 1995) properties later on. To this end, 
let J^j[n) denote the smallest u-field making all of j („),..., measurable, 
1 < i < Pn- Also let !Fn be the smallest a-field making all of ^i{^n)-, ■ ■ ■ ^^pn{n) 
measurable. 

Theorem 2 There exists universal constants < ci,C2 < oo such that, for all 
n,Pn > 2, 



(3) 



max 

i<i<Pn 



< 



Ci logn + C2 log Pr, 



4ii 



n 



for some stochastic processes -Bi(„), . . . , Bp^(^n) which are conditionally independent 
given O'nd for which each Bji^^) ^ standard Brownian bridge with conditional 
distribution given depending only on J^j(n)> 1 ^ i ^ Pn- 



2. 2 Estimation of marginal sample means. Now we consider marginal inference 
based on the marginal sample mean. For each 1 < j < Pn, assume for this section 
that the closure of the support of -F'j(n) is a compact interval [aj(n)) ^i(n)] with aj(„) / 
bj(^n)i ^i^d that has mean and standard deviation cTj(„) > 0. Let be 
the sample mean of . . . ,Xnj[n)- The following corollary yields simultaneous 

consistency of the marginal sample means: 
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Corollary 1 Under the conditions of theorem Q and with the same constant 
Co, we have for all n,pn > 2, 



(4) 



max \Xj^n) - ^J'jin)\ 



l/'2 



Remark 2 A'^ote that corollary^ slightly extends the large p small n consistency 
results of van der Laan and Bryan (2001) by allowing the range of the support to 
increase with n provided it does not increase too rapidly. 

Now assume that we wish to test the marginal null hypothesis Hq^^^ : ^j[n) = 
/io,j(n) with the test statistic 



T. 



a 



where <Tj(n) is a location-invariant and consistent estimator of o'j(n)- To use FDR, 
we need to obtain uniformly consistent estimates of the p- values of these tests. One 
way to do this is with permutation methods. A computationally easier way is to 
just use '7rj(ra) = 2^(— |Tj(„)|), where ^> is the distribution function for the standard 
normal. The conclusion of the following corollary is that this approach leads to 
uniformly consistent p- values under reasonable conditions: 

Corollary 2 Let the constants ci,C2 be as in theorem\^ Then, for all n,pn > 
2, there exist standard normal random variables Zk^^)? ■ ■ ■ i^p„{n) which are condi- 
tionally independent given Tn und for which each ^j(n) has conditional distribution 
given Tn depending only on ^j{n), 1 ^ j ^Pn, such that 



(5) i^gJ^iW -^i{n)| 



Cilogn + C2logp„ / \bj{n) - aj(ri)\ 

< — I max — — 

J<j<Pn aj(^n) 

1 1 



n 



+ 1 (i™(<Tj(„) Vaj(„)) 



a 



j{n) "j{n) 
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where x V y denotes the maximum of x,y and 



(6) 



2$ 



Z 



3{n) 



+ 



In particular, if n 
and 



oo, maxi<j<p„ - o-j(n)l/('7j(„)0-j(„)) ^ in probability, 



(7) 



log(n V 



X max 



<7n 



t/ien i/ie left-hand- side of ^ ^ in probability. 



Remark 3 When \bj{n)~(^j(n)\/^j(n) "i^ bounded, condition ^ becomes log^ pn/ 



n 



0(1). 



Remark 4 Now, suppose the indices Jn = {l,...,Pn} o,f^ divided into two 
groups, Jon and Jin, where Hq^^^ holds for all j £ Jon and where (5j(n) = \l''j{n) ~ 
A*oj(n)l/o'j(n) > for all j £ Jin, whcre r > 0. Then all of the vrj(„)S for j £ J^n 
will simultaneously converge to uniform random variables with the same dependency 
structure inherent in the data (as per the discussion before theorem\^ above) . More- 
over, all of the T^j{n) for j G Jin will simultaneously converge to 0. Thus the q-value 
approach to controlling FDR given in Storey, Taylor and Siegmund (2004) should 
work under their weak dependence conditions (7)~(9) (see also their theorem 5). A 
minor adjustment to this argument will also work for contiguous alternative hypothe- 
ses where the ^/n5j^n) quantities converge to bounded constants. 

2.3 Estimation of marginal sample medians. Now we consider inference for the 
median. Assume that each -Fj(n) has median £,j{n) and is continuous in a neighbor- 
hood of ^j[n) with density fj{n). In this section, we do not require the support of 
Fj(^n) to be compact. We do, however, assume that there exists r/, r > such that 



(8) 



min inf fj(n){x) > t. 

l<J<PnX:|x-5^(„)|<»7 
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Denote the sample median for . . . , X^j^n) ^-s ^j{n)- More precisely, let iji^n) = 

inf{x : Fj(n-^{x) > 1/2}. The following corollary gives simultaneous consistency of 

4-(n): 

Corollary 3 Under condition (0) (for some r],T > 0) and the conditions of 
corollary 1, we have that 



(9) l^iW - ^j{n)\ = Op I + • 



Now assume that we wish to test the marginal null hypothesis H^^^^ : £,j(^n) — 
Co,i(n) with the test statistics = 2^/j(„)(^j(„) - Co,i(n.))) where fj (•„■) IS a con- 
sistent estimator of fj(n){^j{n))- duscussed in Kosorok (1999), this is a good 
choice of median test because it converges rapidly to its limiting Gaussian distri- 
bution and appears to have better moderate sample size performance compared to 
other median tests. As with the marginal mean test, we need consistent estimates of 
the p-values of these tests. We now study the consistency of the p-value estimates 
vr^.j-^-j = 2<I>(— |C/j(-„)|). We need some additional conditions. Assume there exists 
r/, r > and M < oo such that Q holds and, moreover, that 



(10) inax sup < M 



and 



/iiN \fj{n)i^j(n) + u) - fj(n)iCj(n))\ . 

(11) max sup sup -^^^ — -jr. — < M. 



We now have the following corollary: 

Corollary 4 Under conditions J7^) and 177)) . for some r/, r > and M < 

oo, and provided both maxi<j<p^ l/i(n) ~ fj{n){Cj{n))\ = op(l) and log^p„/n as 
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n — > cxD, we have that 



(12) ^maxj^;.(„)-7r;(„)| = op(l), 



where 



(13) 7r^-(„) = 2<^> [-\Zj^n) +2^/nfj(^n)iCj{n)){Cj{n) - Co,j(n))\) , 

and, for each n > 1, Zp^^(„) are standard normals conditionally inde- 

pendent given Tn and for which each Zj(,n) has conditional distribution given Tn 
depending only on ^j(n); 1 < J < Pn- 

Now, for corollary^to be useful in conducting inference, we need simultaneously 
consistent estimators /,(«)• One possibility is 



where the window widths /ij(n) are allowed to depend on the data but must satisfy 
maxi<j<p„ = op(l) and 



r-i / lognVpn , log Pn I 
(15) max h., h A/ = op[l). 



If, in addition to the conditions of corollary ^ we assume conditions (jSJ and (|1U() 
apply to the lower and upper quartiles of the distributions -Fj(n)) then = 
2Ij(„)n~-'^/^, where is the sample interquartile range based on satisfies 
this requirement. This can be argued by first noting that Ij{n) is asymptotically 
simultaneously bounded above and below and that 



n V n"^'^ V n^/'^ 
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There are many other possibihties that will also work. 



3. Marginal asymptotics for two-sample comparisons. The results of 
section 2 can be extended to two sample results, where we have two i.i.d. samples 
of vectors of length pn, where n = rii + n2, and where is the size of sample 
k, for k = 1,2. Consistency results for estimating marginal distribution functions, 
marginal means and marginal medians follows essentially without modification from 
theorem ^ and corollaries ^ a-nd |21 Our interest will therefore focus on the more 
challenging issue of testing whether the marginal means or medians are the same 
between the two samples. We use superscript (k) to denote membership in group 
k, for k = 1,2. In particular, = (xf^ ^, . . . , X^^\ is the ith observed 

vector in the A;th group. In a similar manner, F^^\, a!~^} \ 7^ U^^,\, 

•^iW ^ ^' ^iW ^'^'^ -^iW' < j < Pn, k = I, 2, and ah n > 1, are the two-sample 
versions of the corresponding one-sample quantities introduced in section 2. Also 
let J^* ,=a( J^^}\,J^^^\) andJ'* = a(j^* J^* , 

~ (k) 

We first consider comparing the marginal means. Let sample mean 

(k) (k) 

assume that we wish to test the marginal null hypothesis 
hI^"'^ : u^^) s = 11^^) s with the test statistic 



ni 



3W 



nin2 (^(i) _^{2) 



"-2 



where (5"':/\ is a location-invariant and consistent estimator of a^^\, k = 1,2. The 



following corollary provides conditions under which p-values estimated by tt* 



2<I> ( — |T*j.^j| ) are uniformly consistent over all 1 < j < Pn'- 



Corollary 5 Let the constants ci , C2 he as in theorem\^ Then for allni,n2,Pn > 
2, there exist standard normal random variables Z*., Z* , s which are condi- 

' l(n)' ' p„(ra) 

tionally independent given JF* and for which each Z*^^-^ has conditional distribution 
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given J^* depending only on ^j(^n)' ^ ^ J ^ Pn, such that 



(16) max 



< 



E 

fc=l,2 



|,(fe) _ (k) I 

Ci log nfc + C2 log p„ / I (n) "^j (n) I 

max - 



„ (A: 
Jm ]{n) 



where 



(17) 



2$ 



nin2 



ni 



(2) 



+ n2 



a 



(1) 



(2) 



In particular, ifuj. — > oo, maxi<j<p^ 
and 



a\,\ — cr\,\ 



I (^iW'^iw) ^ ^ in probability, 



(18) 



log(nfc V p„) 



X max 
i<i<Pn 



a 



(k) 



0, 



for k = 1,2, then the left-hand- side of Iil6\) — > in probability. 



We now consider comparing marginal medians. Assume that we wish to test the 
marginal null hypothesis H^^""^ : ^j^^-^ = with the test statistic 



U* 



ni/ 



"^(2) " 
J i{n) 



+ n2/ 



J jin) 



2 VS'W 



t(2) 



where is consistent for fj[n)^^j{ny^ ^ ~ l'^' 'l'^^ following corollary provides 
conditions under which p-values estimated by 7r*„) = 2<I> (~l^j^n)l) uniformly 
consistent over all 1 < j < Pn- 
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Corollary 6 Assume that the one-sample conditions given in expressions 0), 
U(J\) and ill}) , for all of the marginal distribution functions and densities in both 
samples, are satisfied for constants r],T > and < M < cxd. Assume also that 



maxfc=i^2; i<i<p„ 
Then 



ffin) - fj{n)(^jZ)^ = 0^(1) log^P„/(ni Ans) -^Oasn 



oo. 



(19) 



max 

l<i<Pn 



vr 



# 



# 



op{l), 



where 



(20) 



vr 



# 



2$ 



nin2 



(1) 



,(2) 



and, for each n > 1, Z^,^-^ , . . . , Z* are standard normals conditionally inde- 
pendent given J^* and for which each Z*^^-^ has conditional distribution given J^* 
depending only on T*,^-., 1 < j < Pn- 



4. Distribution free statistics. When the distribution of the test statistic 
under the null hypothesis does not depend on the distribution function, results 
stronger than those presented in sections 2 and 3 are possible for marginal p-value 
consistency. Consider first the one-sample setting, and assume that the distributions 
are all continuous and symmetric around their respective medians. Suppose 
we are interested in marginal testing of H^^"^^ : Cj{n) = using the signed rank test 
Tj(„-) studied in section 3 of Kosorok and Ma (2005). Define 



Tj{n) - (^' + n)/4 
Y^(3n3 + 2n2 + n)/24' 



Note that the distribution of Vj(„) does not depend on under Let 

It is easy to verify that converges 



be the exact distribution of Vj(n) under Hq 
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uniformly to $. Hence 



m^ \2^n{-\Vj^n)\) - 2^(-|VS(n)l)l ^ 0, 

regardless of how fast Pn grows. Thus the normal approximation is simultaneously 
consistent for the true p- values when n — ^ oo, without any constraints on 

The key feature that makes this work is that the p-values depend only on the 
correctness of the probability calculation under the null hypothesis. P-value com- 
putations do not require knowledge of the distribution under alternatives. The only 
possibly unnatural assumption required for the above signed-rank test is symmetry 
about the median. An alternative statistic is the sign test. Under the null hypothe- 
sis that the median is zero, the sign test is Bernoulli with probability 1/2. As with 
the signed-rank test, the standardized sign test under the null converges to a normal 
limit. A disadvantage of the sign test is that the range of possible values is limited, 
resulting in a granular distribution which converges somewhat slowly to the normal 
limit. 

Similar reasoning applies to distribution-free two-sample test statistics. Interest- 
ingly, there appears to be a larger variety of useful tests to choose from which do not 
require specification of the distribution function than there are in the one-sample 

i(n) (1) (2) 

setting. Suppose we are interested in marginal testing oi Hq ' : ~ ^j{n)^ 

(k) 

and we assume that the -^j(^) are continuous for all 1 < j < Pn and k = 1,2. Let 



3(n) ^ j{n) 



(i) = n7^j:7\ 1 i^^? ^ < t}, for k = 1,2; = n"! 

We now consider several statistics which are 
invariant under monotone transformations of the data: 

1. The two-sample Wilcoxon rank sum test T*^^-j = \/T2 Jj^ G'j(^)(s)(iF^|||^(s); 



2. The two-sample Kolmogorov-Smirnov test T*?-. = sup^^^ 



Gj{n) 



3. The two-sample Cramer- von Mises test f*^^-^ = G2^^)(s)dFj°^)(s). 

Fix j G {1,. . . ,pn} and assume i^Q^"^ holds. All three of these statistics are now 
invariant under the monotone transformation t i-^ where = F^}\ = 

j(n)\ j{n) j{n) 
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(2) 

Thus, without loss of generahty, we can assume the data are i.i.d. uni- 
form [0,1]. For m = 1,2,3, let K*"^ be the corresponding cumulative distribution 
function for the statistic T*J^s^ under this uniformity assumption (note that it does 
not depend on j because of the invariance), and let Kq^ be the limiting cumula- 
tive distribution function. Suppose that we compute approximate p-values for the 



three statistics as follows: tt*} ^ = 



7^*1 



and = 1 - i^n"" 



for m = 2,3. Because it can be shown that Kq"^ is continuous for all m = 1,2,3, 
the convergence of K*"^ to Kq"^ is uniform. Thus, even after we drop the H^^""^ 
assumption, the approximate p-values based on K*^ are simultaneously consistent 
for the true p-values, as n — > oo, without constrainting pn. 

The following lemma yields the form of Kq'^, for m = 1,2,3. The results are 
essentially classical, but they are included here for completeness: 

Lemma 1 For m = 1,2, 3, K*"^ converges uniformly to K^"^ , as n\ f\n2 ^ oo, 
where 

• i^o*' = ^; 

• Fort > 0, KQ^{t) = 1 — 2^^-^(— l)'e~^'^*^ is the distribution of the supremum 
in absolute value of a standard Brownian bridge; 

• Kq^ is the distribution o/vr^^ Yl'iLi ^ '^^n where Zi, Z2, . . . are i.i.d. standard 
normals. 



5. Impact of microarray normalization. In this section, we consider the 
affect of normalization on the theory presented in sections 2-4. For the simple 
normalization model this will require the /ijS to be uniformly consistent at 
the rate Op{y/nlogn). This requirement seems reasonable for certain estimation 
methods, including the method described in FPH. In this method, data across all 
genes within each array are utilized for estimating the /ijS. Since the number of genes 
Pn usually increases nearly exponentially relative to the number of microarrays, the 
number of observations available for estimating the /ijS is many orders of magnitude 
higher than n, even after taking into account dependencies within arrays and the 
fact that the number of arrays is increasing in n. For this particular facet of our 
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problem, the large number of genes actually works in our favor. A variant of this 
argument can also be found in Kosorok and Ma (2005). 

Consider first the one-sample setting of section 2. Let = . . . , Xjp^^(„))' 

be an approximation of the "true data" 1 < i < n, and define 



e„ — max l-'^ij(n) ~ ^ij(n)\ 

l<J<Pn;l<«<n 



With proper, partially consistent normalization, the true gene effects < 
j ^ PnA ^ i ^ IT-} should be uniformly consistently estimated by the residuals 
from the normalization < j < PnA ^ i ^ n}. In other words e„ = 

op(l). The essence of our arguments involves an assessment of how well Fj^n){i) = 
Y^^=i l{-^jj{n) t} approximates Fj{n){t) uniformly in t. We need the following 
strengthening of condition HlOfl : 



(21) limsup max sup/,v„)(t) < M, 

n~*oo 1<3<P 



for some M < oo. We now have the following theorem, the proof of which involves 
a precise bound on the modulus of continuity of Brownian motion (see lemma |21 in 
section 8 below): 

Theorem 3 Assume condition \21\) holds for some M < oo. Then the following 
are true: 

(i) Iflogpn/n = o(l) and in = op(l), then 

= op(i); 



max 

l<j<Pn 



(a) If, in addition, log^pn/n = o(l) and ^/n{logn)en = Op{l), then also 

= op(n-i/2). 



max 

l<i<Pn 
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Remark 5 Note that the one-sample signed rank test Tj(^n) can be written as 
a normalization of y/n Fjf^j^-jlu) — Fj(^^-^[—u) dFj^^-^(u), and the one-sample sign 
test can be written as a normalization of ^/n J^sign{u)dFj(^j-^-j{u). Thus part (ii) of 
theorem [3 allows us to replace Fj(^n) with Fj{^^n) both of these statistics without 
destroying the simultaneous consistency over 1 < j < Pn established in section 4 of 
the normal approximation for the true p-values based on the true data. 

Theorem 01 can also be used to verify that the asymptotic results for the one- 
sample mean and median tests of sections 2 and 3 can be similarly extended for the 
approximate data („),... , For j = 1, . . . ,pn, let -^j(n) be the sample mean 
of . . . , and define the approximate sample median S^jf^^) = inf{r : 

Fj(„)(r) > 1/2}. The following corollary yields consistency of these estimators: 

Corollary 7 Assume the conditions of theorem\^ part (i), hold. Then 

fi; maxi<j<p„ Fj^n)-Fj(n) = op{l); 

(ii) Provided lira supn^^mayii< j<p„ |6j(„)-aj(„)| < oo, maxi<j<p„ \Xj(^n)-l^j{n)\ = 

op{i); 

(Hi) maxi<j<p„ - = op(l). 

The following corollary strengthens result (ii) of corollary [7| above and yields 
consistency of the p-values of one-sample tests based on the approximate data: 

Corollary 8 Assume the conditions of theorem\^ part (ii), hold. Then the 
following results are true under the given conditions: 

(i) Provided limsup„_^oo maxi<j<p^ n-'^/^|6j(„) — aj(„)| < oo, maxi<j<p^ l^i(n) ~ 
l^j{n)\ = op{l). 

(ii) Suppose that the conditions of coroUary\^hold, except that ^j(n) is used instead 
of Xj^^-^ and that all other estimated quantities are based on rather than 
on Fji^n)j for j = 1, . . . ,Pn- Then, provided 

,. {I^i(n) ~ ^} 

hmsup max -!-^iL^ ^i-i L < oo 
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and maxi<j<p„ \^3{n) - ^j(n)| / (^j(n)0"j(n)) 0, maxi<j<p„ |7i-j(„) - 1Tj(n)\ = 

op(l), /or the filtmtions Tn and ^j{n)j ^ 1^ 3 1^ Pn, based on the true data. 

(in) Suppose that the conditions of coroUary^hold, except that S,j(n) used instead 
o/^j(„) and that all other estimated quantities are based on rather than 

on Fj^n); fof j = 1, • • • ,Pn- Then the conclusions of corollary^ still hold for 
the filtrations Tn ctnd •^j(n)) 1 ^ J ^ Pn, based on the true data. 

Remark 6 Parts (ii) and (Hi) of corollary\^ tell us that we can construct valid 
mean and median based hypothesis tests from suitably normalized data, and that any 
dependencies beyond the original dependency structure induced by the approximation 
vanish asymptotically. Thus the arguments given in remark ^ regarding the validity 
of the q-value approach for controlling FDR still hold after normalization. 

The extension of these results to the two-sample setting is straightforward. As 
done in section 4, we will use superscript (k) to denote membership in group k, for 
k = 1,2. Let be the empirical distribution of the approximate data sample 

^ifin)^ - ■ ■ '^nj{ny ^f{n) = ^^^i^ ' ^ l/^ls ^^n' be the maximum error 

between the approximate and true data for group k] and redefine = e^"^ Vei^\ Also 
let r*™ be the version off*" with replacing F^^\, for /c = 1, 2 and m = 1, 2, 3. 
The following corollary gives the main two-sample approximation results: 

Corollary 9 Assume niAn2 oo; limsup„_,oc max^^i 2 maxi<j<p„ 1 1 1 1 00 < 
M, for some M < 00; log^p„/(ni A 712) = o(l); and ^/n{\ogn)en = Op{l). Then 
the following are true under the given conditions: 

(i) Suppose that the conditions of corollary \^ hold, except the sample means are 
based on the approximate data and all other estimated quantities are based on 
Fj^^-^ rather than on F^^y for j = 1, . . . ,pn and k = 1, 2. Then, provided 



/iaW I vll 
iim sup max max ttt < 00 

n^oo fc=l,2 1<j<p„ 

i(") 
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, .(fc) (fc) , / (fc) .(fc) \ 

and maxfc=i,2maxi<,<p„ a)^l^^ - a)^^^^ / (f^^^^^a^^^) j - 

^l(n)-^l(n) =0P(1), 



max 

/or i/ie filtrations J-n and J'j{-n); 1 < i < Pn; ^asec? on the true data, 
(a) Suppose that the conditions of corollary\^hold, except that C]^^) ^-^ used instead 

"(k) ~ (k) 

of and that all other estimated quantities are based on F^(^^-^ rather than 
on Fj^y for j = 1, . . . ,pn and k = 1,2. Then the conclusions of corollary \^ 
still hold for the filtrations J-n and J'j[n)! 1 ^ i ^ Pru based on the true data. 

(Hi) maxi<j<p^ j(n) ~ j{n) ~ ^Pi^)' /^'^ ^ ~ ^'^jS. Thus the approximate p- 
values based on the approximate data for the three distribution-free two-sample 
tests given in section 4 are uniformly consistent for the true p-values based on 
the true data. 



6. Numerical studies. 

6.1 One-sample simulation study. We used a small simulation study to assess 
the finite sample performance of the following one-sample methodologies: (1) the 
mean based comparison of section 2.2, (2) the median based comparison of section 
2.3 and (3) the signed rank test of section 4. We set the number of genes to p = 2000 
and the number of arrays to n = 20, 50. Let Zn, Zi2, . . ., i = 1, . . . , n, be a sequence 
of i.i.d. standard normal random variables. We generated simulated data using the 
following three models: 

Model 1: Xij = H{Zij) for i = 1, . . . 

Model 2: Xij = H (EE^li)™m+i ^'/^) ^i^^ A; = 10,m = 7; 
Model 3: Same as Model 2, but with /c = 10, m = 3. 

In the above, H = 2<I> — 1, where $ is the cumulative distribution for the standard 
normal. This yields a marginal unif[— 1,1] distribution for all three models. The 
genes in model 1 are i.i.d., while in model 2 there is strong dependence and in 
model 3 weak dependence between genes. We assume the first 40 genes have non-zero 
means, denoted as (3 and generated from unif\— 2,2]. For each approach, marginal 
p-values are computed based on the asymptotic results for one-sample tests given 
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in sections 2 and 4. For the median approach, density estimation is based on the 
interquartile range band-width kernel described in the last paragraph of section 2.3. 
We employ standard FDR techniques with expected FDR E{FDR) = 0.2. The 
marginal p-valucs are ranked, resulting in the ordered p-vahies 7r(x) < 7r(2) < . • • < 
7r(p). Let g be the largest g such that vr^g) < g/p ^ where q is the target FDR 
(for the simulations, q = 0.2). Genes corresponding to vr^i), . . . tt^^) are identified as 
significantly differentially expressed. 

Simulation results based on 100 replicates per scenario are shown in Table 1. We 
can see that as the sample size increases, the performances of all three approaches 
generally improve. When the sample size is small, the mean based approach can 
effectively identify differentially expressed genes, but with high false positive rates. 
Empirical FDRs for the rank approach are quite low. The rank based approach 
misses quite a few true positives. When the sample size is large, the median approach 
and the rank approach perform much better than the mean based approach, with 
less false positives while still being able to identify true positives. The presence of 
correlation appears to have very little impact on the performance. 

6.2 Two-sample simulation study. Since the affect of dependence between genes 
in the simulation study of section 6.1 was minimal, we decided to restrict our focus 
on the i.i.d. gene setting for the two-sample simulations. We set the number of 
genes to p = 2000 and numbers of arrays (sample sizes) to ni = n2 = 10, 30, 60. 
The model we explore is Model 4: X^^^^ ~ nm/[— 1, 1], i = 1, . . . ,kk, j = 1, . . . ,p, 
and k = 1,2. For this data, we apply the mean approach, the median approach, the 
Wilcoxon test and the Kolmogorov-Smirnov test to the two-sample comparison of 
X^j'' + (3 versus X^j \ where f3 is generated as in section 6.1 for the first 40 genes 
of each array. Summary statistics for E{FDR) = 0.2 and 100 replicates are shown 
in Table 2. Similar conclusions as in section 6.1 on the effects of sample size and 
gene distribution can be made. We especially notice that when the sample size is 
small, the mean based approach appears to be the only one that can identify a 
significant number of true positives. The false positive rates arc smaller than the 
target for the median, Wilcoxon and Kolmogorov-Smirnov (KS) approaches. The 
mismatch between the empirical FDR with the target FDR can be serious for the 
mean approach, especially when the sample size is small. 
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Based on other numerical studies (not presented), it appears that part of the 
convergence difficulties with the nonparametric approaches (in both the one and 
two sample settings) are due to the small number of distinct possible values these 
statistics can have. It is unclear how to solve this problem for the nonparametric one- 
sample tests, but it appears that the two-sample tests can be improved by replacing 



Gj{n) with Gj^n) = \Jn\niln 



F 



(1) 



, where F^f\ 



(n2/(n2 + l))4'i). This 



increases the number of possible values of the statistic, and preliminary simulation 
studies (also not presented) indicate that the rate of convergence for smaller sam- 
ple sizes is improved. Thus we recommend that this modification be considered 
whenever ni = n2- Note that the modification does not affect the asymptotics since 



nin2 



n 



n2 



n2 + 1 



< 



1 



6.3 Estrogen data. These datasets were first presented by West et al. (2001) 
and Spang et al. (2001). Their common expression matrix monitors 7129 genes 
in 49 breast tumor samples. The data were obtained by applying the Affymetrix 
gene chip technology. The response describes the lymph nodal (LN) status, which 
is an indicator for the metastatic spread of the tumor, an important risk factor 
for disease outcome. 25 samples are positive (LN+) and 24 samples are negative 
(LN-). The goal is to identify genes differentially expressed between positive and 
negative samples from the 3332 genes passing the first step of processing described 
in Dudoit, Fridlyand and Speed (2002). A base 2 logarithmic transformation of the 
gene expressions is first applied. 

We set the target FDR to 0.1 and apply the standard FDR method with the 
four two-sample comparison approaches: 445 (mean), 261 (median), 423 (Wilcox) 
and 211 (KS) genes are identified, respectively. The mean based approach and the 
Wilcoxon test identify significantly more genes than the median approach and the KS 
test. This pattern was also demonstrated in Table 2 (for sample size ni = n2 = 30). 
It is unclear what causes these differences. However, the overlaps of genes identified 
by the different approaches are substantial. For example, there are 196 common 
genes between the mean approach and the median approach. In Figure ^ we show 
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scatter plots of p- values from the different approaches. The rank correlation coef- 
ficients show substantial similarities among different approaches. Note the banded 
pattern in the plots involving the KS statistic. This is a consequence of the low num- 
ber of distinct possible values this statistic can have as was discussed in section 6.2 
above. 

7. Discussion. The main results of this paper are that marginal (gene specific) 
estimates and asymptotic-based p-values are uniformly consistent in microarray 
experiments with n replications — regardless of the dependencies between genes — 
provided the number of genes Pn satisfies logPn = o(n), logp„ = o(n^/^) or logp„ = 
o(n^/^), depending on the desired task. In other words, the number of genes is 
allowed to increase almost exponentially fast relative to the number of arrays. This 
seems to be a realistic asymptotic regime for microarray studies. These results also 
hold true for two-sample comparisons. Moreover, the results continue to hold even 
after normalization, provided the normalization process is sufficiently accurate. 

We note that the simulation and data analyses seem to support the theoretical 
results of the paper, although some test procedures appear to work better than 
others. We also acknowledge that a number of important issues, such as the affect 
of marginal distribution on the asymptotics and the affect of normalization, were 
not evaluated in the limited simulation studies presented in section 6. A refined and 
more thorough simulation study that addresses these points is beyond the scope of 
the current paper but is worth pursuing in the future. 

A theoretical limitation of the present study is that the asymptotics developed 
are not yet accurate enough to provide precise guidelines on sample size for specific 
microarray experiments. The development of such guidelines is worthwhile to pursue 
as a future topic, but it most likely would require at least some assumptions on the 
dependencies between genes. Such assumptions are out of place in the present paper 
since a strength of the paper is the absence of assumptions on gene interdependence. 
It is because of this generality that we believe the results of this paper should be a 
useful point of departure for future, more refined asymptotic analyses of microarray 
experiments. 
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8. Proofs. 

Proof of theorem^ Define Vj(„) = — -Fj(„)||oo, and note that by theo- 

rem|l]below combined with lemma 2.2.1 of van der Vaart and Wehner (1996) (abbre- 
viated VW hereafter), ||Vj(„)||^2 < \/3/2 for all 1 < j < Pn- Now, by lemma 2.2.2 
of VW combined with the fact that Iimsupx y^f^ip2{x)ip2{y)/ip2ixy) = 0, we have 
that there exists a universal constant c=k < oo such that ||maxi<j<p^ < 
c^,Y^log(l -|-pn)\/3/2 for all n > 1. The desired result now follows for the constant 
Co = \/6c*, since log(/c + 1) < 2 log A; for any k > 2.0 

Theorem 4 Let Yi, . . . ,Yn be an i.i.d. sample of real random variables with dis- 
tribution G (not necessarily continuous), and let Gn be the corresponding empirical 
distribution function. Then 



p(supV^ Gn{t)-G{t) >x\< 2e-2^', 



for all X > 0. 



Proof. This is the celebrated result of Dvoretsky, Kiefer and Wolfowitz (1956), 
given in their lemma 2, as refined by Massart (1990) in his corollary 1. We omit 
the proof of their result but note that their result applies to the special case 
where G is continuous. We now show that it also applies when G may be dis- 
continuous. Without loss of generality, assume that G has discontinuities, and 
let Ti, . . . ,Tjyi be the locations of the discontinuities of G, where m may be in- 
finity. Note that the number of discontinuities can be at most countable. Let 
ri, . . . , be the jump sizes of G at Ti, . . . , T^. Now let C/i, . . . , C/„ be i.i.d. uniform 
random variables independent of the Yi, . . . ,Yn, and define new random variables 
Z^ = Yi + YlY=i rj [l{Tj < Yi} + l{Tj = Yi}Ui] , 1 < i < n. Define also the trans- 
formation t R{t) = t-\- X^JLi '^jli^j — 0' be the empirical distribution of 
Zi, . . . , Zn] and let H be the distribution of Zi. It is not hard to verify that 



sup|G„(t) -G(t)| = snp\Hn{Rit))- H{Rit))\ 
< sup\Hn{s)-H{s)\, 
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and the desired result now follows since H is continuous. □ 

Proof of theorem \^ Let Uij^n)-, ^ = 0, ...,n and j = be indepen- 
dent uniform random variables. Then, by theorem [3 below, there exist Brown- 
ian bridges i?i („),..., where, for each 1 < j < Pn, depends only on 



X 



>^nj{n) and UQji^n)^ 



• Unj{n) and 



(22) P(V^ >x + 121ogn) < 2e-"/^ 



for all X > and all n > 2. Now define 



n 



logn 



Vn{Fj{n) - Fj{n)) - Bj(n)iF 



12 



where is the positive part of u. By lemma 2.2.1 of VW, expression ()22|) implies 
that < 18/ logn. Reapplying the result that log{k + 1) < 2 log A; for any 

k > 2, we now have, by the fact that limsup^. j^^^^^ iJiix)ipi{y) / ipi{xy) = combined 
with lemma 2.2.2 of VW, that there exists a universal constant < C2 < cxd for which 



max U. 



< 



■01 



C2 log Pn 

logn 



Now © follows, for ci = 12, from the definition of C/j^^^.n 

Theorem 5 For n > 2, let Yi, . . . ,Yn be i.i.d. real random variables with dis- 
tribution G (not necessarily continuous) , and let Uq, . . . ,Un be independent uniform 
random variables independent ofYi,...,Yn. Then there exists a standard Brownian 
motion B depending only on Yi, . . . ,Yn and Uq, . . . ,Un such that, for all x > 0, 



(23) P(V^ Vn{Gn-G) - B{G) >x + 121ogn) < 2e-^/^ 



where Gn is the empirical distribution of Yi, . . . ,Yn. 



Proof. We will apply the same method for handling the discontinuities of G 
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as used in the proof of theorem \^ Let m > 0, Ti, . . . , T^, and ri, . . . , be as 
defined in the proof of theorem^] Similarly define Zi, . . . , Z^, R, Hn and except 
that we will utilize the uniform random variables Ui, . . . ,Un given in the statement 
of theorem [51 By the continuity of H as established in the proof of theorem 0J 
H[Zi) is now uniformly distributed. Thus, by the Hungarian construction theorem 
(theorem 1) of Bretagnolle and Massart (1989), there exists a Brownian bridge B 
depending only on Zi, . . . , Z„ and C/q such that 



> X + 121ogn ) < 2e 



-x/6 



for all X > 0. The desired result now follows since 



sup 



V^(G„(t)-G(t))-5(G(t)) 



= sup 

tm 

< sup 



V^{Hn{R{t)) - H{R{t))) - B{H{Rit))) 



V^iHn{s)-His))-BiHis)) 



.□ 



Proof of corollary^\ The result is a consequence of theorem ^ via the following 
integration by parts identity: 



(24) 



[aj(n)>''j(n)] 



(n) [X J 



/ 



Fj[n){x) - Fj(„)(x) 



dx.U 



Proof of corollary \^ Note that for any x G M and any y > 0, 



\^{xy) — ^{x)\ < sup \x\(p{xu)\y — 1| 

lAy<u<lVy 

\y-M 

< 0.25 X sup 

lAy<u<lVy U 



< 0.25 X |1 - y| V 



1 



1 
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The constant 0.25 comes from the fact that sup„>Q u(j){u) < (27re) < 0.25. Thus 



(25) iggJ*JW-^V)l - ^ ( i^J-^iW V ^j(n)) 



j{n) "j{n) 



where 7r*(„) = 2^{-\T*^^^\) and T*^^^ = V^(Xj(„) - ^oj(n))/t^i(n)- 

Now the integration by parts formula 1)24^ combined with theorem |21 yields 



max 



+ 



a 



' Ci log n + C2 log p„ \ I hj{n) - aj^ri) I 

< I — I max — — 



n 



i<i<Pn Crj(„) 



where ci, C2 and i?i („),... , Bp^(^n) are as given in theorem |21 and where 



aj('i).''j(n)] 



is standard normal for all 1 < j < Pn- This, combined with the fact that |$(x) — 
^(y)| < — y|/2 for all x,y G M, yields the desired result. □ 

Proof of corollary\^ That the left-hand-side of © is op(l) follows from condi- 
tion (jH)) combined with theorem ^ By the definition of the sample median, we have 
that Fj(n){(,j(n)) ~ ^j(n){Cj(n)) = ^j(n)^ where <J 1/n. This now implies that 

the mean value theorem and condition lIHll.D 



Ej(,fi)- The result now follows from 



Proof of Corollary^ Now, for some ^*^^^ in between ^j^^) and Cj(n)) we have 

■/'i(ri)(^i(n))(4(n) ~ ^i{n)) = - Pj(n){ij{n)) + ^i(n) (^i(n) ) • Using the Conditions of the 
corollary, we obtain that the fj{n) terms are simultaneously consistent for the quan- 
tities fj{n){i*j{n)} a.nd that these later quantities are bounded above and below. Now 
we can argue as in the first part of the proof of corollary |21 that maxi<j<p,j \^'j{n) ~ 
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^j(n)l = o^'(l)' where = 2$(-|?7j(„)|) and 

= -2y/n{Fjf^ri){ij(n)) " ^i(n) (Cj(ri) )) + 2\/n/i(n) ) ~ ^0,j(n))- 

Note that 



" (^j(n)(li(n)) - ^i(n)(?i(n)) " ^i(n)(^j(n)) + ^i(n) (Ci(n) )^ 



where Cj(„) = ^/nEj(^n) ^j{n) defined in the proof of corollaryElwith l-Ej^^)! < 
1/n. Hence Cj(„) vanishes asymptotically, uniformly over 1 < j < Pn- Theorem |2 
tells us that we can, uniformly over 1 < j < Pn, replace ^j(n) and V^(n) with = 
Bj{n){Fj(n){ij(n))) - Bj{n){Fj{n){^j{n))) and V^^^^ = Bj(„)(l/2). Note that = 
25j(„)(l/2) are standard normals and and that Bj^^j^-^{t) = — iWj(„)(l), for 

all t E [0,1], for some standard Brownian motions Wj^^)- Thus, by the symmetry 
properties of Brownian motion, 



< a/5 



sup sup 

0<t<<5,(„) 0<t<5j(„) 



^j{n)\Wj{n){'^)\ = ^i(n)('5j(n))i 



where (5j(„) = M|^j(„) — '^j{n)|; is as defined in ©; and where VFj(n)) ^j'(n) 
W't ^ are Brownian motions. 

Now, for each k < oo and p > 0, we have 



< P ( max Aj(^n)(.krn) > p] + P ( max > A;r„ 
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where r„ = log(n V Pn)/n + y^logp^/n. However, using the facts that a stan- 
dard normal deviate and the supremum of the absolute value of a Brownian mo- 
tion over [0, 1] both have sub-Gaussian tails (i.e., have bounded i/^2-iiorms), we 
have maxi<j<p,^ lj(„)(A;r„) < Op (Vlogp„ [r„ + ^ 0, in probability, since 

Xo^Pn/n 0. Thus the first term on the right-hand-side of 1)261) goes to zero. 
Since corollary |21 implies lim^^oo lim sup^^^^^ P ^maxi<j<p^ (5j(„) > A;r„^ = 0, the 
left-hand-side of (f^^ also goes to zero as n ^ cxd. Thus t^j(n) can be approximated 
by = Zj(_n) + '^Vrifjin){^j(n))(^j{n) - ^ojin)) simultaneously over all 1 < j < p„. 

Now we can use arguments given at the beginning of the proof of corollary [21 
(again) in combination with the simultaneous consistency of (,j(n) ^he assumed 
properties of to obtain that 



max 



^(-\U'jin)\) - ^ 



fj{n)iCj(n)) 



fj{n) i^j, 



Zj{n) + '2V^fj(n){Cj(n)){S.j{n) " Co,j(n)) 



j{n)) 



op(l). Now define rjn = maxi<j<p„ l?i(n) ~ij(n)\- By condition ((TT|). we have that 



max 

i<i<Pn 



/j(n)(gj(n)) _ ^ 
fj{n)ii*j(n)> 



X < Op [ ^maxjZ,(„) 



X max suD sud l^^'wC^iW + ^) " /j(n)(gj-(n))l 1/2 
X max sup sup , r/^ 



< Op(Vlogp„) X r/^ 



1/2 



Op I Vlog]5„ X 

0, 



' log n V Pn _^ /logPn 



n 



where the equality follows from corollary |2l 
Proof of corollary The proof follows 
corollary |2l Using the fact that, for any x 



The desired result now follows. □ 

the same general logic as the proof of 

G M and any y > 0, \^{xy) - (^{x)\ < 
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0.25 x|l — y|v|l — y 'we have 



< max — 

i<j<Pii 2 



rii 



a 



(2) 

i(™) 



+ n2 



a 



(1) 



2\ 1/2 



V 



(2) 



2 



(T 



(2) 
"-2 



+ n2 



(1) 



(1) 



2\ 1/2 



(2) 



+ n2 



(1) 




whereTT-^) =2$(-|r*;^|) and 



^2 



(1) 



1/2 



ni 



(2) 



+ ^2 



a 



ni 



a 



(2) 



(1) 

2 



(1) 



1/2 



^2 - fl 



(2) _ (2) 



ni 



£7 



(2) 



+ n2 



(1) 



1 



nin2 



Jin) 



+ n2 



2 l^i(n) 



Now, virtually identical Brownian bridge approximation arguments to those used in 
the proof of corollary [21 yield that 



max 

l<i<Pn 



Ci lognfc + C2 logpn 



fc=l,2 



max 

i<i<Pn 



0" 



(fe) 



In order to finish the proof, we need to bound the right-hand-side of (|27|) . To 
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begin with, note that for any scalars ci , C2 , di , (^2 > 0, 

.2 \ 1/2 



\nidl + 77-2^1 y 



1 



< 



+ 



/ nicl + n2cf \ _ / ?T4c|_+n2^\ 

^ niC2 + n2(if ^ / 71,1(^2 + ^2^2 



1/5 



< 



\n1d2 + n2di J \nid2 + n2di 

n2d\ ^ 



+ 



711^2 + n2df 
2 



di 



< 



+ 



^-1 
d2 



where the second inequahty follows from the fact that for any a,b,x,y > 0, 



, ax 



Hence both 



and 







2 

+ n2 




\ni 


r^(2) 


2 

+ n2 




^ n\ 


^(2) 1 


2 

+ n2 








2 

+ n2 





2\ 1/2 



< 



(1) 



(7 



(1) 



+ 



2\ 1/2 



< 



(1) 



0" 



(1) 



+ 



^(2) 



^(2) 



(2) 



and thus the right-hand-side of ()27() is bounded by 



,(A:) (fe) 
max I (T •/^ V (7 ./\ 



(T 



(fc) 
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completing the proof. □ 

Proof of corollary The proof consists of extending the proof of corollary 
in a manner similar to the way in which the proof of corollary [21 was extended for 



proving corollary [51 A key difference is that the role of CTj^fl^-^ and <5'j^^-) in the proof 



,(fc) 



of corollary [SI is replaced by and for k = 1,2 and 1 < j < Pn- The 

remaining necessary extensions of the proof of corollary [31 are straightforward. □ 

Proof of lemma 01 Because of the invariance under monotone transformation, 
we can assume without loss of generality that the data are uniformly distributed. 
Classical arguments in Billingsley (1968) yield the second result. In particular, the 
form of the limiting distribution function, which is the distribution of the supremum 
in absolute value of a Brownian bridge, can be found on page 85 of Billingsley. 
Arguments for establishing the remaining two results can be found in section 3.9.4 
(for the Wilcoxon statistic) and in section 2.13.2 (for the Cramer-von Mises statistic) 
of van der Vaart and Wellner (1996). □ 



Proof of theorem\^ Define En = maxi<j<p^ 



and, for each 6 > 



0, En{S) = maxi<j<p^ sup|^_^|<5 Fj(„)(s) — Fj(„)(t) . Suppose now that for some 
positive, non-increasing sequences {s„,(5n}, with (5„ 0, we have En{5n) = op(sra) 
and P(en > 6n) = o{l). Then, by the definition of in, 



(28) En 



-Enl{en < 6n} + -E„l{e„ > 6n} < En{5n) + Op(s„) 



Op{Sn)- 



Now, by theorem [21 and condition (PT]) . we have for any sequence (5n i 0, 



VnEniSn) < max sup ^/n Fj/n){s) - Fjfn){s) - Fj(n){t) + Fjfn){t) + ^/nM 

l<J<Pnls^t\<5„ 

< max sup \Bj(^n){Fj{„){s)) - Bj(^n){Fj(n){t))\ 

l<J<Pn \s-t\<5„ 

flogn + logpn , 
+Op I ^ h \/ndr 



Combining this with a reapplication of condition 1)2 1() along with lemma [21 below (a 
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precise modulus of continuity bound for Brownian motion), we obtain 



(29) V^En{5n) < Op (y^il0gpn)6nl0g{l/6n) + ^°g^ +l°gP^ + 

\ vn 



Both H28() and ()29() will prove useful at several points in our proof. 

Using the fact that e„ = op(l), we can find a positive, sufficiently slowly de- 
creasing sequence 5n — > such that in 

= op{6n)- Now, by applying (gHl) with 
Sn = 1, we obtain result (i) of the theorem: £"„ = op(l). For result (ii), we can 
use the fact that log^p„/n = o(l), to construct a positive, non-decreasing sequence 
r„ — > oo slowly enough so that r„logpri/\Ai = o(l) and r^/logn = o(l). Since 
^/n{logn)in = Op(l), we have 



— 1 fM- \ rn \'~ / logVn-log(Vrae„) , 

ne„log(l/e„) = V"(logn)e„ = Op(l). 

log n 



Thus, if we set bn = r„/(-^/nlog n), we have e„ = op((5„). We also have, by (|2 
that 



^ / /l r„logp„ log ^ -Flog log n - log r„ / 1 

^n(<3n) =0p [ \ - X T= X \ ^ 

\\n yn logn Vv"' 



op{n 



-1/2N 



The proof is now complete by reapplying (|28j) with the choice Sn = n ^/^.D 

Lemma 2 Let W : [0,1] ^ W be a standard Brownian motion. Then there exists 
a universal constant /cq < oo such that 



sup \W{s)-W{t)\ 

\s-t\<S 



for aU0<6< 1/2. 



Proof. Fix 5 £ (0, 1). Let ns be the smallest integer > 1 + 1/(5, and extend the 
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Brownian motion W to the interval [0,(5n5]. Now 



(30) sup \W{s) 

\s-t\<S 



W{t)\ < max sup \W{s 

^<j<ns {j~l)5<s<t<{j+l)S 

< 2 max sup 

l<j<nst(,[Q_i)s,(j+l)5] 

< 2 max sup V26\W*{t)\ 



W{t)\ 

\w{t)-w{{j-m\ 



where W^, . . . , W*^ are a dependent collection of standard Brownian motions. The 
last inequality follows from the symmetry properties of Brownian motion. We can 
now use the fact that the tail probabilities of the supremum over [0, 1] of the absolute 
value of Brownian motion are sub-Gaussian (and thus have bounded ■02-iiorms) to 



obtain that the •i/'2-norm of the left side of (|30|) is bounded by k^,2\j2b log(l + n^) < 
A;*2y^2(5 log(3 + l/(^) < A;o\/(^log(l/5), where = 5k^ does not depend on 6. The 
last inequality follows because log(3 + l/5)/log(l/(^) < 1 + log(l + 3(^)/ log(l/(^) < 3 
for ah 6 G (0,l/2].n 

Proof of corollary ^ Result (i) follows directly from part (i) of theorem |21 and 
theorem ^ Result (ii) is a direct consequence of part (i) of theorem |31 and a minor 
modification of the integration by parts identity (|24() used in the proof of corollary |21 
The proof of result (iii) is a straightforward extension of the proof of corollary 
which incorporates the conclusion of part (i) of theorem |5Jn 

Proof of corollary \^ For result (i), we use part (ii) of theorem |S1 combined with 
integration by parts to obtain that 



^^fpj^j{n)-^j{n)\=0p[n 



max \bi 



i<i<Pn 



+ 2e" 



Op(l). 



Now corollary n gives us the desired results since 



logPn 



n i<i<p 



max |6j(„) - aj(„)| 



'logPn 



max n 



n i<i<pr, 



^''^|fej(n) - aj{n)\ = 0{l). 



For result (ii), we also use part (ii) of theorem combined with integration by parts 
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to obtain 



max 



< Op max = op(l 

\l<j<Pn aj(^n) J 



and the desired result follows using the Brownian bridge approximation of ^/n (-^j(n) 
—t^j{n)) /^j{n) given in the proof of corollary[21 For result (iii), the desired conclusion 
is obtained via part (ii) of theorem |31 combined with a straightforward adaptation 
of the proof of corollary 0Jn 

Proof of coroUary\^ The proof follows almost immediately from applying part (ii) 
of theorem |S1 to each sample separately, yielding the result 



max max v'^fc 

k = l,2 l<j<Pn 



pik) _p{k) 



Op(l) 



Now, the proofs of results (i) and (ii) are direct extensions of the one-sample results 
of corollary |H1 combined with straightforward adaptations of arguments found in the 
proofs of corollaries El and El The proof of result (iii) also follows almost immediately. 
For the Kolmogorov-Smirnov statistic, the result is obvious. For the other two 
statistics, the result follows with some help from integration by parts. □ 
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Table 1. One sample simulation study results for the mean, median and signed rank 
statistics under models 1, 2 and 3. Tot.: total count identified using FDR. Pos.: 
number of true positives identified using FDR. EFDR: empirical FDR. 





Mean 


Median 


Signed rank 


Model 


Tot. (Pos.) 


EFDR Tot. (Pos.) 


EFDR 


Tot. (Pos.) 


EFDR 






Sample size = 


20 






1 


64.7(33.9) 


0.47 31.8(25.4) 


0.19 


15.5(15.5) 


0.01 


2 


64.4(33.9) 


0.47 31.6(25.3) 


0.19 


15.3(15.2) 


0.01 


3 


64.0(33.9) 


0.46 31.1(25.0) 


0.19 


15.2(15.1) 


0.01 






Sample size = 


50 






1 


54.2(37.8) 


0.30 38.7(32.9) 


0.15 


34.5(34.0) 


0.01 


2 


53.7(37.4) 


0.29 38.5(32.7) 


0.14 


34.2(33.8) 


0.01 


3 


52.3(37.5) 


0.27 38.2(32.5) 


0.1 i 


31.4(33.9) 


0.01 
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Table 2. Two sample simulation study results for mean, median, Wilcoxon and 
Kolmogorov-Smirnov (KS) statistics under model 4. Tot.: total count identified 
using FDR. Pos: number of true positives identified using FDR. EFDR: empirical 
FDR. 



Mean Median Wilcoxon KS 



Tot.(Pos.) EFDR Tot. (Pos.) EFDR Tot. (Pos.) EFDR Tot. (Pos.) EFDR 









ni = 


na = 10 








47.3(21.5) 


0.54 


8.4(6.7) 


0.18 


14.4(13.1) 


0.08 


2.6(2.4) 


0.08 








rii = 


n2 = 30 








40.9(28.9) 


0.28 


21.2(19.8) 


0.06 


32.0(26.6) 


0.16 


23.7(21.5) 


0.09 








ni = 


n2 = 60 








43.4(33.3) 


0.22 


29.7(25.4) 


0.14 


39.4(32.4) 


0.17 


32.1(28.0) 


0.12 
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Figure 1: Estrogen data. Scatter plots of p- values comparing the four approaches 
(mean, median, Wilcoxon and KS). A lowess smoother is used to estimate the trend, 
and the associated rank correlation coefficient (tau) is given above each panel. 
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